Automatically Identifying Pseudepigraphic Texts

نویسندگان

  • Moshe Koppel
  • Shachar Seidman
چکیده

The identification of pseudepigraphic texts – texts not written by the authors to which they are attributed – has important historical, forensic and commercial applications. We introduce an unsupervised technique for identifying pseudepigrapha. The idea is to identify textual outliers in a corpus based on the pairwise similarities of all documents in the corpus. The crucial point is that document similarity not be measured in any of the standard ways but rather be based on the output of a recently introduced algorithm for authorship verification. The proposed method strongly outperforms existing techniques in systematic experiments on a blog corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Babylon Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out o...

متن کامل

A Surface-Based Approach To Identifying Discourse Markers And Elementary Textual Units In Unrestricted Texts

I present a surface-based algorithm that employs knowledge of cue phrase usages in order to determine automatically clause boundaries and discourse markers in unrestricted natural language texts. The knowledge was derived from a comprehensive corpus analysis.

متن کامل

Identifying Opinion Holders for Question Answering in Opinion Texts

Question answering in opinion texts has so far mostly concentrated on the identification of opinions and on analyzing the sentiment expressed in opinions. In this paper, we address another important part of Question Answering (QA) in opinion texts: finding opinion holders. Holder identification is a central part of full opinion identification and can be used independently to answer several opin...

متن کامل

IXAGroupEHUDiac: A Multiple Approach System towards the Diachronic Evaluation of Texts

This paper presents our contribution to the SemEval-2015 Task 7. The task was subdivided into three subtasks that consisted of automatically identifying the time period when a piece of news was written (1,2) as well as automatically determining whether a specific phrase in a sentence is relevant or not for a given period of time (3). Our system tackles the resolution of all three subtasks. With...

متن کامل

Detecting Information-Dense Texts in Multiple News Domains

We introduce the task of identifying information-dense texts, which report important factual information in direct, succinct manner. We describe a procedure that allows us to label automatically a large training corpus of New York Times texts. We train a classifier based on lexical, discourse and unlexicalized syntactic features and test its performance on a set of manually annotated articles f...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013